Computer method and apparatus for classifying objects

ABSTRACT

A computer classification method and apparatus employs statistical analysis of known objects in the class of interest. For each known object in the class, a respective vector of q bits is formed. Each bit indicates presence or absence of an activity or physical property in the object. The probability that a bit is equal to 1 in the class is then applied to vector representations of test objects and determines probability of the test object belonging to the class.

RELATED APPLICATION

[0001] This application is a continuation of PCT/US01/44000, filed Nov. 6, 2001 and claims the benefit of U.S. Provisional Application No. 60/246,196, filed Nov. 6, 2000, the entire teachings of which are incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] In this age of information, the development of objective and automated methods for information synthesis is crucial to the productive use of the information. In particular, in the post genomic age when masses of information about genes and the proteins for which they code are being developed, there is a great need for methods by which this information can be reliably synthesized to produce knowledge.

SUMMARY OF THE INVENTION

[0003] In the present method, given a collection of similar objects, some of which possess an activity, some of which lack it and rest of which are unclassified, the active and inactive sets are used to generate a profile which can be used to classify the unclassified objects and also to identify features that are significantly correlated and anti-correlated with activity. The method employs Bayesian statistics and a binary representation of objects in order to generate a profile of the active class. By employing standard statistical techniques in a novel manner, the method is also able to provide a probability that the classification of a specific object is accurate.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

[0005]FIG. 1 is a block diagram of a computer system embodying the present invention.

[0006]FIGS. 2a-2 c are schematic illustrations of a preferred embodiment of the invention software executed in the computer system of FIG. 1.

[0007]FIGS. 3a-3 b are significant feature charts output for the amino acid sequence in osteogenic proteins in the system of FIG. 1.

[0008]FIGS. 4a-4 e are significant feature charts output for the amino acid sequence in osteogenic proteins in the system of FIG. 1.

[0009]FIG. 5 is the mathematical expectation value of a binary distribution given a small sample.

[0010]FIG. 6 is a plot of probability versus normalized score classifying osteogenic BMPs.

[0011]FIG. 7 is a plot of probability versus normalized score classifying osteogenic BMPs.

DETAILED DESCRIPTION OF THE INVENTION

[0012] The present invention provides a method and apparatus for classifying objects given a collection or set of objects known to be similar to each other. In particular, the invention method and apparatus classifies polypeptides given a collection of known proteins (i.e., known to be similar to each other within the set).

[0013] Illustrated in FIG. 1 is the present invention (software program 15) as implemented in a computer system 19. A digital processor 11 executes software program 15 in working memory. Software program 15 receives input 13 from another program, another computer (across a local network or through a communications link to an external network, e.g. the Internet), input device (mouse, keyboard, etc.) or the like. In response to the input, invention system 15 determines whether or not the input is a member of a predefined class. Output 17 from software program 15 is provided to another program, computer, database, or output device (e.g., display monitor) and/or the like.

[0014] In the preferred embodiment, software program 15 is formulated as follows and illustrated in FIGS. 2a-2 c.

[0015] The Core Paradigm

[0016] The method can be used with any system that fits the following core paradigm. Each object 21 within a collection of M similar objects comprises N components (C) 25 wherein there exists a unique correlation between component k in object i and component k in object j: C_(ik)˜C_(jk). Thus a collection of M objects 21 can be represented as a matrix having M rows representing the M objects 21 and N columns representing the N components 25. Each cell in the matrix 23 is either empty or contains one of a set of elements 27 standard to that component 25. The elements 27 are represented as binary vectors 29 of features where each of the Q_(i) bits corresponds to a particular feature, a “1” indicating the presence of that feature and a “0” indicating the lack of that feature. Furthermore, it is required that objects 21 within the collection can be partitioned into three sets: one possessing a particular activity (the active training set), one-lacking that activity (the inactive training set), and one where the activity is yet to be determined (the test set) as illustrated in FIG. 2b.

[0017] Feature Vectors

[0018] Each of the standard elements 27 within a component 25 is represented by a set of Q_(i) features. An element either possess a particular feature or lacks it. Where the natural representation of a feature is a quantitative value, some cutoff value must be chosen below which the feature is judged to be absent (=0). The specific features chosen to represent elements 27 and the cutoff values determining the presence or absence of various feature must be chosen such that each of the standard set of elements 27 has a unique binary vector representation, i.e., such that within the standard element set for a component no two feature vectors 29 are equal. If there are T_(i) standard elements in the i^(th) component, then a feature table 31 is a matrix of “1”s and “0”s having T_(i) rows and Q_(i) columns, where row h is the feature vector for element h. The collection matrix can then be treated as an M×N matrix of 1's and 0's where the number of columns, N=ΣQ_(i) and where one significant row T_(i) (feature vector 29) represents the I^(th) component 25. An object “descriptor” 33 is then a string of N bits as illustrated in FIG. 2b.

[0019] Using Bayesian Log Odds to Construct Classification Profiles

[0020] Bayesian statistics deals with conditional probabilities and empirical logic. If set A is a subset of set B, then one can say that if an element is a member of set A it is also a member of set B, or that the probability that an element is a member of set B given that it is a member of set A, p(B|A), is 1. Suppose that set A is not a subset of set B, but only intersects B, i.e., p(B|A)<1, and one wants to know what the probability is of an element being in both sets A and B, p(AB). If one knows the probability of an element being in A, p(A), and the probability of and element being in B given that it is in A, then

p(AB)=p(B|A) p(A)=p(A|B) p(B).   (Eq. 1)

[0021] By the same reasoning, if one knows the probability of an element being in B, p(B), and the probability of and element being in A given that it is in B, p(A|B), then one can again calculate the probability of an element being in both sets. From Eq. 1, one can express one conditional probability in terms of the other:

p(A|B)=p(B|A) P(A)/p(B).   (Eq. 2)

[0022] Suppose there are three intersecting sets A, B and C. Then by the same line of reasoning $\begin{matrix} {{p\left( {AB} \middle| C \right)} = {{p\left( A \middle| {BC} \right)}\quad {{p({BC})}/{p(C)}}}} \\ {= {{p\left( A \middle| {BC} \right)}\quad {p\left( B \middle| C \right)}}} \end{matrix}$

[0023] which can be extended to four intersecting sets as $\begin{matrix} {{p({ABCD})} = {{p\left( {ABC} \middle| D \right)}\quad {p(D)}}} \\ {= {{p\left( {AB} \middle| {CD} \right)}\quad {p\left( C \middle| D \right)}\quad {p(D)}}} \\ {= {{p\left( A \middle| {BCD} \right)}\quad {p\left( B\quad \middle| {CD} \right)}\quad {p\left( C \middle| D \right)}\quad {p(D)}}} \end{matrix}$

[0024] From this then follows the general chain rule for multiple sets,

p(b ₁ . . . b _(n) |A)=p(b _(n) |b ₁ . . . b _(n−1) ,A) p(b _(n−1) |b ₁ . . . b _(n−2) ,A) . . . p(b ₁ |A)=Πp(b _(i) |b ₁ . . . b _(i−1) ,A), i=1→N.   (Eq. 3)

[0025] If events b₁ and b₂ are independent, then the state of b₁ is not affected by the state of b₂ so that

p(b ₁ |b ₂)=p(b ₁).

[0026] Thus if the set of states {b_(i)} are all independent, then

p(b ₁ . . . b _(n) |A)=Πp(b _(i) |A), i=1→N.   (Eq. 4)

[0027] Two fundamental assumptions in this method are that the state of the i^(th) component 25 is independent of the state of the j^(th) component 25

p(C _(i) |C _(j))=p(C _(i))   (Eq. 5)

[0028] and that within a component, feature bits are also independent

p(b _(ij) |b _(jk))=p(b _(ij)).   (Eq. 6)

[0029] What we are interested in here is the probability that an object 21 is active or inactive given the state of its description in bits, p(A|{b_(i)}) and p(I|{b_(i)}). What we know, however, are different descriptions of active and inactive objects 21. The data then allows us to evaluate p([b_(i)=1)|A), p([b_(i)=0)|A), p([b_(i)=1)|I) and p([b_(i)=0)II). Bayes' rule says that

p(A|{b _(i)})=p({b _(i) }|A)/p(A)/p({b _(i)}), and

p(I|{b _(i)})=p({b _(i) }|I)/p(I)/p({b _(i)}).

[0030] By equation 4,

p({b _(i) }|A)=Πp(b _(i) |A), i=1→N, and

p({b _(i) }|I)=Πp(b _(i) |I), i=1→N.

[0031] Then

p(A|{b _(i)})=Πp(b _(i) |A) p(A)/p({b _(i)}), and   (Eq. 7a)

p(I|{b _(i)})=Πp(b _(i) |I) p(I)/p({b _(i)}).   (Eq. 7b)

[0032] The odds ratio is then $\begin{matrix} {{p\left( A \middle| \left\{ b_{i} \right\} \right)} = {\prod{{p\left( b_{i} \middle| A \right)}\quad {{p(A)}/{p\left( \left\{ b_{i} \right\} \right)}}}}} & \quad \\ \begin{matrix} {\overset{\_}{p\left( I \middle| \left\{ b_{i} \right\} \right)} = \overset{\_}{\prod{{p\left( b_{i} \middle| I \right)}\quad {{p(I)}/{p\left( \left\{ b_{i} \right\} \right)}}}}} \\ {= {\left\lbrack {{p(A)}/{p(I)}} \right\rbrack \quad {\prod{\left\lbrack {{p\left( b_{i} \middle| A \right)}/{p\left( b_{i} \middle| I \right)}} \right\rbrack.}}}} \end{matrix} & \left( {{Eq}.\quad 8} \right) \end{matrix}$

[0033] It is preferable to express profile values as log odds ratios, in part because it is easier to express very small numbers as logs, and because scores can be accumulated as sums rather than products. There are two terms for each bit in the profile:

LO(1)_(i)=log [p(b _(i)=1|A)/p(b _(i)=1|I)], and   (Eq. 9A)

LO(0)_(i)=log [p(b _(i)=0|A)/p(b _(i)=0|I)]  (Eq. 9B)

[0034] A profile is then the set of paired values

P(1)_(i) =p([b _(i)=1]|A)*LO(1) and

P(0)_(i) =p([b _(i)=0]|A)*LO(0)

[0035] for each bit in the object description 33. The two major advantages of using the odds ratio to construct the profile are that first, it is based on the contrast between the active and inactive classes, and second, one does not have to deal with the prior distribution of the bits, p({b_(i)}). Multiplying the log odds by the respective active probability orders the values such that feature conservation within the active class is enhanced.

[0036] Estimating Population Distributions from Small Samples

[0037] Although an unbiased estimator, the sample mean is generally not a good estimate of the population distribution, especially in the limit of small samples. If five white balls are selected from a vase containing some unknown distribution of 1000 black and white balls, it would be unreasonable to postulate that based on the draw of 5 white balls there are no black balls in the vase because the observed sample is so small relative to the size of the population. Furthermore, probability estimates of zero are a major problem in calculations such as that in equations 7 and 8 because one zero probability sends the entire expression to zero. Put another way, while it is reasonable to have small probabilities, it is unreasonable to have zero probabilities. What we want to know is given the sample, what is the expectation value of the population distribution? Given any value for the population distribution one can calculate the probability of observing the sample

p(w,b)=[(w+b)!/w!/b!]p ₀ ^(w)(1−p ₀)^(b),   (Eq. 10)

[0038] where p0 is the population distribution of white balls. The expectation value for p₀ given the observed sample is then $\begin{matrix} {{E\left( {\left. p_{0} \middle| w \right.,b} \right)} = \frac{\int_{0}^{1}{{p_{0}\left\lbrack \frac{\left( {w + b} \right)!}{{w!}{b!}} \right\rbrack}{p_{0}^{w}\left( {1 - p_{0}} \right)}^{b}{p_{0}}}}{\int_{0}^{1}{\left\lbrack \frac{\left( {w + b} \right)!}{{w!}{b!}} \right\rbrack {p_{0}^{w}\left( {1 - p_{0}} \right)}^{b}{p_{0}}}}} & \left( {{Eq}.\quad 11} \right) \end{matrix}$

[0039] This expression is worked out in FIG. 1 with the result that

E(p ₀ |w,b)=(w+1)/(w+b+2).   (Eq. 12)

[0040] Thus for the sample of five white balls, E(p₀)=6/7.

[0041] In order to calculate odds ratios for 1's and 0's at each bit in the profile, it is then necessary to estimate the population frequency of 1's and 0's at that bit. By equation 10

p ₀(A,b _(ij)=1)=(n _(A)(1)(i,j)+1)/(N _(A)+2), and   (Eq. 13A)

p ₀(I,b _(ij)=1)=(n _(I)(1)(i,j)+1)/(N _(I)+2)   (Eq. 13B)

[0042] where b_(ij) is the j^(th) bit for the element vector for the i^(th) component, n_(A)(1)(i,j) and n_(I)(1)(ij) are the number of 1's at bit j of component i in the active and inactive sets, respectively, and N_(A) and N_(I) are the number of objects, respectively, in the active and inactive sets.

[0043] One of the major advantages of using binary vector representations of component elements is that estimation is simplified because the alphabet size is 2. If one were to estimate population frequencies from the observed frequency of the component elements themselves, the likelihood is that the alphabet size, the number of elements in the standard set for the component, would exceed the number of objects in the training set. If there are N_(A) objects in the active training set and n_(i) elements in the standard set for component i, then at least (n_(i)−N_(A)) elements are unsampled. The problem of estimating the population frequency of unsampled elements is a nontrival problem which is circumvented by the use of binary representation.

[0044] The foregoing completes the training phase (FIG. IIc) of invention software 15. Referring now to the lower portion of FIG. IIc, the testing phase of the invention software 15 is shown and described next.

[0045] Using the Profile to Score a Test Object

[0046] The raw score of a test object for a particular profile is the sum of the bitwise score: $\begin{matrix} {S = {{\log \left( \frac{p(A)}{p(I)} \right)} + {\sum\limits_{k = 1}^{N}S_{k}}}} & \left( {{Eq}.\quad 14} \right) \end{matrix}$

[0047] where $k_{ij} = {j + {\sum\limits_{h = 1}^{i - 1}Q_{h}}}$

[0048] indexes bits. The bitwise score

S _(k) =b _(k) P(1)_(k)÷(1−b _(k))P(0)_(k)   (Eq. 14)

[0049] where b_(k) is the value of the k^(th) bit.

[0050] Maximum and Minimum Profile Scores

[0051] Given a standard set of elements for each component there exists a maximum and a minimum possible score for that component. Likewise, then, since the raw score for a profile is the sum of the components scores, there exists a maximum raw score (maxscore) and a minimum raw score (minscore) for a profile, the sums of the maximum and minimum bit scores, respectively.

[0052] Normalized Scores (nscore)

[0053] The maximum and minimum scores for a profile can vary considerably depending upon the constitution of the active and inactive sets. Similarly, the raw score of a test object for a profile can vary greatly depending upon the constitution of the training sets. Much of this variation is eliminated by expressing scores as normalized scores, referred to below as nscores. For the k^(th) test object scored against the j^(th) profile the nscore is

nscore(j,k)=[raw score(j,k)−minscore(j)]/[maxscore(j)−minscore(j)].   (Eq.16)

[0054] The nscore has a value between zero and one.

[0055] Unbiased Scores and Variability Analysis

[0056] Any time a training object is scored against a profile trained on that object, a biased score will result. In order to obtain a score for a training object, a profile is constructed in which that object is left out of the training set, the so called “leave-one-out” method. When training sets are small, one of the best ways to evaluate the accuracy of a profile is to use the “leave-one-out” method. In particular, one can create M=N_(A)+N_(I) partial profiles by leaving out each member of the active and inactive training sets one at a time. For each bit there will then exist M values of P(1)_(i) and of P(0)_(i). These two distributions of M values will each have a mean, and a standard error of the mean. The percent standard error of the mean for P(1) and P(0) (the standard error of the mean divided by the mean) can be used to calculate the error in the raw score when a test object is scored against the complete profile. The percent error E in the raw score is $\begin{matrix} {f_{Err} = {{\sum\limits_{k = 1}^{M}{b_{k}{E_{k}(1)}}} + {\left( {1 - b_{k}} \right){E_{k}(0)}}}} & \left( {{Eq}.\quad 17} \right) \end{matrix}$

[0057] where b_(k) is the k^(th) bit in the test sequence.

[0058] Building a Classifier

[0059] By scoring a left-out member of a training set against the partial profile constructed using its peers, one can generate an “active” distribution of N_(A) active nscores and an “inactive” distribution of N_(I) inactive nscores. These distributions are of great utility in classifying test objects. A classifier is a function that, given an nscore for a test object, generates a value (binary or a probability) that classifies the object as either active or inactive. The active and inactive nscore distributions can be used both to assess the classification quality of the profile and to generate a probability-of-being-active for test objects. The standard statistical method of Student's t-test (one tailed, non-paired, unequal variance) can be used to obtain a probability that the active and inactive distributions are the same, the null hypothesis. To be a good classifier, the active and inactive training scores must form distinct distributions. The value

p(Good Classifier)=(1−p(null))

[0060] should be 0.9 or better if the discriminating ability of a particular profile is sufficient to function as an effective classifier.

[0061] Another common method for assessing classifier accuracy is the area under the “Receiver Operating Characteristic” (ROC) curve. A ROC curve is constructed by plotting, for each nscore value, the frequency of true-positive classifications against the frequency of false-positive classifications. Classifier accuracy can be defined as

α=½(ROC area−½).   (Eq. 18)

[0062] A value of α>0.9 is good. To construct a theoretical ROC curve it is necessary to calculate the probability of true-positive (tp) and false-positive (fp) classifications as a function of nscore: $\begin{matrix} {{p\left( {tp} \middle| {{nscore}>=X} \right)} = {\left( \frac{1}{\sigma_{A}\sqrt{2\quad \pi}} \right){\int_{X}^{+ \infty}{^{- {(\frac{x - \mu_{A}}{2\quad \sigma_{A}})}^{2}}{{x}.}}}}} & \left( {{{Eq}.\quad 19}A} \right) \end{matrix}$

[0063] Similarly, the probability of a false-positive (fp) classification as a function of nscore is $\begin{matrix} {{p\left( {fp} \middle| {{nscore}>=X} \right)} = {\left( \frac{1}{\sigma_{I}\sqrt{2\quad \pi}} \right){\int_{X}^{+ \infty}{^{- {(\frac{x - \mu_{I}}{2\quad \sigma_{I}})}^{2}}{{x}.}}}}} & \left( {{{Eq}.\quad 19}B} \right) \end{matrix}$

[0064] The area under the ROC curve can then be obtained by numerical integration.

[0065] Classifying Test Objects

[0066] There are two approaches to generating a classification probability for a test object. The first and likely most accurate method is to score a test object against each of the M partial profiles in order to generate a distribution of nscores for the test object that is similar to the nscore distributions for the active and inactive sets. The t-test (i.e., single tail, two sample, independent variable) can be used to calculate the probabilities that the test object distribution is identical to the active and to the inactive distributions, respectively. The classification probability is then $\begin{matrix} {{p_{Active}({TestObject})} = \frac{p_{Null}\left( {{TestDist},{ActiveDist}} \right)}{\left( {{p_{Null}\left( {{TestDist},{ActiveDist}} \right)} + {p_{Null}\left( {{TestDist},{InactiveDist}} \right)}} \right)}} & \left( {{Eq}.\quad 20} \right) \end{matrix}$

[0067] An alternative method that is less computationally intensive involves constructing a classification curve as the ratio. Let $\begin{matrix} {{p_{A}\left( {n\quad {score}} \right)} = {\left( \frac{1}{\sigma_{A}\sqrt{2\pi}} \right){\int_{- \infty}^{n\quad {score}}{^{- {(\frac{x - \mu_{A}}{2\sigma_{A}})}^{2}}\quad {x}}}}} & \left( {{{Eq}.\quad 21}A} \right) \\ {{p_{I}\left( {n\quad {score}} \right)} = {\left( \frac{1}{\sigma_{I}\sqrt{2\pi}} \right){\int_{n\quad {score}}^{- \infty}{^{- {(\frac{x - \mu_{I}}{2\sigma_{I}})}^{2}}\quad {x}}}}} & \left( {{{Eq}.\quad 21}B} \right) \\ {{p_{Active}\left( {n\quad {score}} \right)} = \frac{p_{A}\left( {n\quad {score}} \right)}{\left( {{p_{A}\left( {n\quad {score}} \right)} + {p_{I}\left( {n\quad {score}} \right)}} \right)}} & \left( {{Eq}.\quad 22} \right) \end{matrix}$

[0068] To classify a test object, it is first scored once against the complete profile (none of the training set left out) to obtain an nscore value and then p_(Active)(nscore) is calculated from the curve given by eq. 22.

[0069] While method 2 is likely less accurate than method 1 in its prediction of p_(Active) for objects that score in the transition region of the classification curve, it is generally much faster to implement than method 1. The preferred procedure when there is a large number of objects to classify is to use method 2 as an initial filter, and to reclassify those objects for which 0.05<p_(Active)<0.95 using method 1.

[0070] Estimation of Classification Error

[0071] In classification method 2, the uncertainty in the value of p_(Active) equals uncertainty in the nscore value times the absolute value of the slope of the classification curve. Thus the values of p_(Active) are least accurate in the region of intermediate classification. Uncertainty in the nscore value has two origins. First, there is uncertainty in the horizontal position of the classification curve because there is a finite error of the mean of both the active and the inactive distributions, and secondly, there is uncertainty in the nscore value for the test object as discussed above. If the active and inactive distributions are well separated (i.e., the profile accuracy Figure is greater than 0.9) then the transition region of the classification curve will be narrow and steep so that not far either side of this region the classification curve will have a zero slope and the error in p_(Active) will vanish regardless of the size of the nscore errors (FIGS. 6 and 7).

[0072] Identification of Activity Correlated Features

[0073] Informational relative entropy is a measure of the information contained in the difference between two distributions. As such, it can also be considered to be a measure of informational significance. For a binary distribution the relative entropy is given as

H(p|q)=p ₀ log [p ₀ /q ₀ ]+p ₁ log [p ₁ / q ₁]  (Eq. 23)

[0074] where q is the reference distribution, p₁+p₀=1, and q₁+q₀=1. In the present method, distribution p is the distribution of 1's for a bit in the active set and q is the distribution of 1's for that bit in the inactive set. We therefore define the bitwise significance as

s _(ij) =p _(A)(1)_(ij) LO(1)_(ij) +p _(A)(0)_(ij) LO(0)_(ij)   (Eq. 24)

[0075] where ij indexes the j^(th) bit of the i^(th) component in the respective sets, and LO(1) and LO(0) are the log odds ratios of eq. 10. In order to determine which features in which components contribute most the classification characteristics of a profile, one need only to look at those features having the largest significance.

[0076] Another embodiment of the present invention is a cyclic polypeptide that can modulate the activity of bone morphogenetic proteins (BMP), particularly, bone morphogenetic protein-7 (BMP) (inhibit or enhance). The cyclic polypeptide is homologous to the Finger 1, Finger 2 or Heel region of bone morphogenetic protein-7, which have the following amino acid sequences: SEQ ID NO. 1, SEQ ID NO.1 KKHELYVSFRDLGWQDWIIAPEGYAAYY (Finger 1); SEQ ID NO.2 AFPLNSYMNATNHAIVQTLVHFINPETVPKP (Heel); and SEQ ID NO:3 APTQLNAISVLYFDDSSNVILKKYRNMVVRACGC (Finger 2).

[0077] “Homologous” means that the cyclic polypeptide has the amino acid sequence of SEQ ID NOS. 1, 2 or 3 or a fragment thereof having at least 5, typically at least 10, more typically at least 11 and often at least 15 amino acids, provided that the polypeptide can have 1, 2, 3, 4 or 5 amino acids which differ from the wild type. The polypeptides modulate bone morphogenetic protein-7 activity. Polypeptides having the amino acid sequence of SEQ ID NOS. 4-9 are specifically excluded. Preferably, the polypeptides of the present invention are homologous to polypeptides having the amino acid sequence of SEQ ID NOS 4-9, with the aforesaid exclusion. Preferably, the polypeptides are cyclized by replacing two amino acids from the wild type sequence with cysteine and then forming a disulfide bond (e.g., a solution of 25 mg of iodine in 5 mL of 80% aqueous acetic acid with 5 mg of peptide, preferably with protected side chain functional groups). F1-1 (5′CELYVSFRDLGWQDWIIAPEGYAAYC, SEQ ID NO.4) F1-2 (CFRDLGWQDWIIAPC, SEQ ID NO.5) H-1 (CAFPLNSYMNATNHAIVQTLVHFINPETVPKC, SEQ ID NO.6) H-2C (CCFINPETVCC, SEQ ID NO.7) F2-2 (CYFDDSSNVIC, SEQ ID NO.8) F2-3 (CYFDDSSNVICKKYRS, SEQ ID NO.9)

[0078] The bold indicates these cysteines residues are connected by a disulfide bond.

[0079] Suitable amino acid substitutions in Finger 1, Finger 2 and the Heel re-ions are determined by the computational methods described hereinabove. In particular, apply significance equation 24 to each bit of each amino acid feature vector in each protein. Take the top most significant bits of each feature vector of the amino acids in these three regions and correlate those to the features (physical properties) represented by the respective bit. Examples of the significant features ordering and corresponding features per bit are illustrated in FIGS. 3a, 3 b and 4 a-4 e.

[0080] Physiologically acceptable salts of the polypeptides are also included.

[0081] Another embodiment of the present invention is a method of treating a subject in need of treatment which modulates (inhibits or enhances) the activity of BMP. An effective amount of the polypeptide is administered to the subject.

[0082] Polypeptides which inhibit the activity of BMP can be used to treat subjects in whom a reduction of BMP-7 activity can provide a useful therapeutic effect. Examples include pituitary abnormalities and other endocrinopathies. Also included are subjects in need of treatment with angiogenesis inhibitors (e.g., patients with cancer), with agents that reduce arteriosclerosis, and agents which prevent restenosis (e.g., patients following angioplasty).

[0083] Polypeptides which enhance the activity of BMP-7 can be used to stimulate the formation of new bone and could therefore be used to treat osteoporosis. These compounds can also enhance the functional remodeling of remaining neural tissues following neural ischemia such as stroke when used within a therapeutic time window, or to promote recovery of drug induced ischemia in the kidney and the effects of protein overload, or to ameliorate the effects of acute myocardial ischemic injury and reperfusion injury. They may be also useful in the treatment of certain types of cancer, e.g., prostate cancer and pituitary adenomas, and ameliorating the effects of chemically induced inflammatory lesion in the colon.

[0084] All “effective amount” of the peptides of the present invention is the quantity of peptide which results in a desired therapeutic and/or prophylactic effect while without causing unacceptable side-effects when administered to a subject having one of the aforementioned diseases or conditions. A “desired therapeutic effect” includes one or more of the following: 1) an amelioration of the symptom(s) associated with the disease or condition; 2) a delay in the onset of symptoms associated with the disease or condition; 3) increased longevity compared with the absence of the treatment; and 4) greater quality of life compared with the absence of the treatment.

[0085] An “effective amount” of the peptide administered to a subject will also depend on the type and severity of the disease and on the characteristics of the subject, such as general health, age, sex, body weight and tolerance to drugs. The skilled artisan will be able to determine appropriate dosages depending on these and other factors. Typically, an effective amount of a peptide of the invention can range from about 0.01 mg per day to about 1000 mg per day for an adult. Preferably, the dosage ranges from about 0.1 mg per day to about 100 mg per day, more preferably from about 1.0 mg/day to about 10 mg/day.

[0086] The peptides of the present invention can, for example, be administered orally, by nasal administration, inhalation or parenterally. Parenteral administration can include, for example, systemic administration, such as by intramuscular, intravenous, subcutaneous, or intraperitoneal injection. The peptides can be administered to the subject in conjunction with an acceptable pharmaceutical carrier, diluent or excipient as part of a pharmaceutical composition for treating the diseases discussed above. Suitable pharmaceutical carriers may contain inert ingredients which do not interact with the peptide or peptide derivative. Standard pharmaceutical formulation techniques may be employed such as those described in Remington's Pharmaceutical Sciences, Mack Publishing Company, Easton, Pa.

[0087] Suitable pharmaceutical carriers for parenteral administration include, for example, sterile water, physiological saline, bacteriostatic saline (saline containing about 0.9% mg/ml benzyl alcohol), phosphate-buffered saline, Hank's solution, Ringer's-lactate and the like. Some examples of suitable excipients include lactose, dextrose, sucrose, trehalose, sorbitol, and mannitol.

[0088] A “subject” is a mammal, preferably a human, but can also be an animal, e.g., domestic animals (e.g., dogs, cats, and the like), farm animals (e.g., cows, sheep, pigs, horses, and the like) and laboratory animals (e.g., rats, mice, guinea pigs, and the like).

EXAMPLE 1

[0089] Classification of Protein Sequences by Activity

[0090] The following analogy is made between the central paradigm of the classification method and the case of protein sequences. Protein sequences are objects. A set of sequences similar enough to be aligned as a super family constitutes a collection. The aligned sequence positions are components. In this case all components have the same standard set of elements which is the 20 naturally occurring amino acids and so have the same vector width, Q. A binary vector scheme of width Q=12 is shown in Table 1. The 12 features making up the feature set are: hydrophobicity, helix propensity, sheet propensity, hydrogen donor propensity, hydrogen acceptor propensity, the state of being charged, aromaticity, sidechain linearity (unbranched), medium sidechain volume, large sidechain volume, Phi-Psi flexibility and crosslinkability (disulfide bond formation). The central paradigm requires that one assume that aligned sequence positions are independent and that features are independent.

EXAMPLE 2

[0091] Classification of Osteogenic Sequences in the TGFβ Protein Super Family-I

[0092] Table 2 is an aligned set of TGFβ super family sequences. Those with a plus sign next to them are known to be able to stimulate the formation of ectopic bone, while those with a minus sign next to them are known to be unable to form ectopic bone. In this example the active set includes BMP7, BMP6, BMP5, BMP4 and BMP2. Dpp and 60A, both known osteogenic proteins from drosophila melogaster, are reserved for test purposes. The inactive set includes sequences for TGFβ1, BMP3, GDF8, InhibinβA and GDF6. The results are presented in Table 3 and FIG. 2. The classifier is good, having and accuracy figure of 99.9% by the t-test and 94.8% by the ROC curve area. Using either classification methods 1 or 2, the classifier correctly identifies dpp and 60A as being osteogenic with a probability greater than 99% despite the fact that their origin is an insect which has a chitin exoskeleton and no bones. Within the test set, the only other protein predicted to be a possible osteogenic molecule is UNIVIN with an osteogenic probability of 83% (method 1) and 89% (method 2).

EXAMPLE 3

[0093] Classification of Osteogenic Sequences in the TGFβ Protein Super Family-II

[0094] In this example, dpp and 60A have been added to the active training set used in example 2. The inactive set is the same as that for example 2. The results are presented in Table 4 and FIG. 7. The classifier accuracy figures of 99.94% (t-test) and 98% (ROC curve area) are improved with the addition of dpp and 60A. UNIVIN still scores in the classification transition area with a p_(Active) of 13.5% (method 1) and 39% (method 2). The effect of adding dpp and 60A to the active training set is to shift the transition zone (0.1<p_(Active)<0.9) to higher values of nscore (p_(Active)=50% occurs at an nscore of 0.67 in example 1 and at 0.695 in this example) and to narrow the zone (0.07 in example 2 versus 0.05 in this example). Thus, even though the nscore values for UNIVIN are higher in this example (0.718 versus 0.682 in Example 2 using method 1, and 0.720 versus 0.696 in Example 2 using method 1), it actually scores lower (13% using method 1 and 39% using method 2). Despite the fact that it is less likely to be an osteogenic protein, the classifier still identifies it as the most interesting member of the test set to pursue research oil.

EXAMPLE 4

[0095] Identification of those Features and Residue Positions having the Largest Significance for Osteogenicity

[0096] In this example, the structure of the complete profile created in example 3 is examined to identify those features that are correlated or are anti-correlated with osteogenic activity. There are two properties of interest. First is the relative entropy of a feature where the higher the relative entropy the larger the significance, and second is the percent variation associated with the positive P value at each bit. The significance of a bit having a large relative entropy is reduced if it also has a large percent variation.

[0097] While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.

1 240 1 28 PRT Homo sapiens 1 Lys Lys His Glu Leu Tyr Val Ser Phe Arg Asp Leu Gly Trp Gln Asp 1 5 10 15 Trp Ile Ile Ala Pro Glu Gly Tyr Ala Ala Tyr Tyr 20 25 2 31 PRT Homo sapiens 2 Ala Phe Pro Leu Asn Ser Tyr Met Asn Ala Thr Asn His Ala Ile Val 1 5 10 15 Gln Thr Leu Val His Phe Ile Asn Pro Glu Thr Val Pro Lys Pro 20 25 30 3 34 PRT Homo sapiens 3 Ala Pro Thr Gln Leu Asn Ala Ile Ser Val Leu Tyr Phe Asp Asp Ser 1 5 10 15 Ser Asn Val Ile Leu Lys Lys Tyr Arg Asn Met Val Val Arg Ala Cys 20 25 30 Gly Cys 4 26 PRT Homo sapiens 4 Cys Glu Leu Tyr Val Ser Phe Arg Asp Leu Gly Trp Gln Asp Trp Ile 1 5 10 15 Ile Ala Pro Glu Gly Tyr Ala Ala Tyr Cys 20 25 5 15 PRT Homo sapiens 5 Cys Phe Arg Asp Leu Gly Trp Gln Asp Trp Ile Ile Ala Pro Cys 1 5 10 15 6 32 PRT Homo sapiens 6 Cys Ala Phe Pro Leu Asn Ser Tyr Met Asn Ala Thr Asn His Ala Ile 1 5 10 15 Val Gln Thr Leu Val His Phe Ile Asn Pro Glu Thr Val Pro Lys Cys 20 25 30 7 11 PRT Homo sapiens 7 Cys Cys Phe Ile Asn Pro Glu Thr Val Cys Cys 1 5 10 8 11 PRT Homo sapiens 8 Cys Tyr Phe Asp Asp Ser Ser Asn Val Ile Cys 1 5 10 9 16 PRT Homo sapiens 9 Cys Tyr Phe Asp Asp Ser Ser Asn Val Ile Cys Lys Lys Tyr Arg Ser 1 5 10 15 10 11 PRT Homo sapiens 10 Cys Lys Lys His Glu Leu Tyr Val Ser Phe Arg 1 5 10 11 23 PRT Homo sapiens 11 Asp Leu Gly Trp Gln Asp Trp Ile Ile Ala Pro Glu Gly Tyr Ala Ala 1 5 10 15 Tyr Tyr Cys Glu Gly Glu Cys 20 12 11 PRT Homo sapiens 12 Cys Lys Lys His Glu Leu Tyr Val Ser Phe Arg 1 5 10 13 23 PRT Homo sapiens 13 Asp Leu Gly Trp Gln Asp Trp Ile Ile Ala Pro Glu Gly Tyr Ala Ala 1 5 10 15 Phe Tyr Cys Asp Gly Glu Cys 20 14 11 PRT Homo sapiens 14 Cys Arg Lys His Glu Leu Tyr Val Ser Phe Gln 1 5 10 15 23 PRT Homo sapiens 15 Asp Leu Gly Trp Gln Asp Trp Ile Ile Ala Pro Lys Gly Tyr Ala Ala 1 5 10 15 Asn Tyr Cys Asp Gly Glu Cys 20 16 11 PRT Homo sapiens 16 Cys Gln Met Gln Thr Leu Tyr Ile Asp Phe Lys 1 5 10 17 23 PRT Homo sapiens 17 Asp Leu Gly Trp His Asp Trp Ile Ile Ala Pro Glu Gly Tyr Gly Ala 1 5 10 15 Phe Tyr Cys Ser Gly Glu Cys 20 18 11 PRT Homo sapiens 18 Cys Lys Arg His Pro Leu Tyr Val Asp Phe Ser 1 5 10 19 23 PRT Homo sapiens 19 Asp Val Gly Trp Asn Asp Trp Ile Val Ala Pro Pro Gly Tyr His Ala 1 5 10 15 Phe Tyr Cys His Gly Glu Cys 20 20 11 PRT Homo sapiens 20 Cys Arg Arg His Ser Leu Tyr Val Asp Phe Ser 1 5 10 21 23 PRT Homo sapiens 21 Asp Val Gly Trp Asn Asp Trp Ile Val Ala Pro Pro Gly Tyr Gln Ala 1 5 10 15 Phe Tyr Cys His Gly Asp Cys 20 22 11 PRT Homo sapiens 22 Cys Arg Arg His Ser Leu Tyr Val Asp Phe Ser 1 5 10 23 23 PRT Homo sapiens 23 Asp Val Gly Trp Asp Asp Trp Ile Val Ala Pro Leu Gly Tyr Asp Ala 1 5 10 15 Tyr Tyr Cys His Gly Lys Cys 20 24 11 PRT Homo sapiens 24 Cys Cys Leu Tyr Asp Leu Glu Ile Glu Phe Glu 1 5 10 25 4 PRT Homo sapiens 25 Lys Ile Gly Trp 1 26 18 PRT Homo sapiens 26 Asp Trp Ile Val Ala Pro Pro Arg Tyr Asn Ala Tyr Met Cys Arg Gly 1 5 10 15 Asp Cys 27 11 PRT Homo sapiens 27 Cys Cys Lys Lys Gln Phe Phe Val Ser Phe Lys 1 5 10 28 23 PRT Homo sapiens 28 Asp Ile Gly Trp Asn Asp Trp Ile Ile Ala Pro Ser Gly Tyr His Ala 1 5 10 15 Asn Tyr Cys Glu Gly Glu Cys 20 29 16 PRT Homo sapiens 29 Cys Cys Val Arg Gln Leu Tyr Ile Asp Phe Arg Lys Asp Leu Gly Trp 1 5 10 15 30 18 PRT Homo sapiens 30 Lys Trp Ile His Glu Pro Lys Gly Tyr His Ala Asn Phe Cys Leu Gly 1 5 10 15 Pro Cys 31 16 PRT Homo sapiens 31 Cys Cys Leu Arg Pro Leu Tyr Ile Asp Phe Lys Arg Asp Leu Gly Trp 1 5 10 15 32 18 PRT Homo sapiens 32 Lys Trp Ile His Glu Pro Lys Gly Tyr Asn Ala Asn Phe Cys Ala Gly 1 5 10 15 Ala Cys 33 11 PRT Homo sapiens 33 Cys Arg Ala Arg Arg Leu Tyr Val Ser Phe Arg 1 5 10 34 23 PRT Homo sapiens 34 Glu Val Gly Trp His Arg Trp Val Ile Ala Pro Arg Gly Phe Leu Ala 1 5 10 15 Asn Tyr Cys Gln Gly Gln Cys 20 35 11 PRT Homo sapiens 35 Cys Ser Arg Lys Ala Leu His Val Asn Phe Lys 1 5 10 36 23 PRT Homo sapiens 36 Asp Met Gly Trp Asp Asp Trp Ile Ile Ala Pro Leu Glu Tyr Glu Ala 1 5 10 15 Phe His Cys Glu Gly Leu Cys 20 37 11 PRT Homo sapiens 37 Cys Ser Arg Lys Pro Leu His Val Asn Phe Lys 1 5 10 38 23 PRT Homo sapiens 38 Glu Leu Gly Trp Asp Asp Trp Ile Ile Ala Pro Leu Glu Tyr Glu Ala 1 5 10 15 Tyr His Cys Glu Gly Val Cys 20 39 11 PRT Homo sapiens 39 Cys Ser Arg Lys Ser Leu His Val Asp Phe Lys 1 5 10 40 23 PRT Homo sapiens 40 Glu Leu Gly Trp Asp Asp Trp Ile Ile Ala Pro Leu Asp Tyr Glu Ala 1 5 10 15 Tyr His Cys Glu Gly Val Cys 20 41 11 PRT Homo sapiens 41 Cys Ala Arg Arg Tyr Leu Lys Val Asp Phe Ala 1 5 10 42 23 PRT Homo sapiens 42 Asp Ile Gly Trp Ser Glu Trp Ile Ile Ser Pro Lys Ser Phe Asp Ala 1 5 10 15 Tyr Tyr Cys Ser Gly Ala Cys 20 43 11 PRT Homo sapiens 43 Cys Arg Lys Val Lys Phe Gln Val Asp Phe Asn 1 5 10 44 23 PRT Homo sapiens 44 Leu Ile Gly Trp Gly Ser Trp Ile Ile Tyr Pro Lys Gln Tyr Asn Ala 1 5 10 15 Tyr Arg Cys Glu Gly Glu Cys 20 45 11 PRT Homo sapiens 45 Cys Ser Leu His Pro Phe Gln Ile Ser Phe Arg 1 5 10 46 23 PRT Homo sapiens 46 Gln Leu Gly Trp Asp His Trp Ile Ile Ala Pro Pro Phe Tyr Thr Pro 1 5 10 15 Asn Tyr Cys Lys Gly Thr Cys 20 47 8 PRT Homo sapiens 47 Ala Phe Pro Leu Asn Ser Tyr Met 1 5 48 17 PRT Homo sapiens 48 Asn Ala Thr Asn His Ala Ile Val Gln Thr Leu Val His Phe Ile Asn 1 5 10 15 Pro 49 8 PRT Homo sapiens 49 Glu Thr Val Pro Lys Pro Cys Cys 1 5 50 8 PRT Homo sapiens 50 Ser Phe Pro Leu Asn Ala His Met 1 5 51 17 PRT Homo sapiens 51 Asn Ala Thr Asn His Ala Ile Val Gln Thr Leu Val His Leu Met Phe 1 5 10 15 Pro 52 8 PRT Homo sapiens 52 Asp His Val Pro Lys Pro Cys Cys 1 5 53 8 PRT Homo sapiens 53 Ser Phe Pro Leu Asn Ala His Met 1 5 54 17 PRT Homo sapiens 54 Asn Ala Thr Asn His Ala Ile Val Gln Thr Leu Val His Leu Met Asn 1 5 10 15 Pro 55 8 PRT Homo sapiens 55 Glu Tyr Val Pro Lys Pro Cys Cys 1 5 56 8 PRT Homo sapiens 56 Asn Phe Pro Leu Asn Ala His Met 1 5 57 17 PRT Homo sapiens 57 Asn Ala Thr Asn His Ala Ile Val Gln Thr Leu Val His Leu Leu Glu 1 5 10 15 Pro 58 8 PRT Homo sapiens 58 Lys Lys Val Pro Lys Pro Cys Cys 1 5 59 8 PRT Homo sapiens 59 Pro Phe Pro Leu Ala Asp His Leu 1 5 60 17 PRT Homo sapiens 60 Asn Ser Thr Asn His Ala Ile Val Gln Thr Leu Val Asn Ser Val Asn 1 5 10 15 Ser 61 7 PRT Homo sapiens 61 Lys Ile Pro Lys Ala Cys Cys 1 5 62 8 PRT Homo sapiens 62 Pro Phe Pro Leu Ala Asp His Leu 1 5 63 17 PRT Homo sapiens 63 Asn Ser Thr Asn His Ala Ile Val Gln Thr Leu Val Asn Ser Val Asn 1 5 10 15 Ser 64 7 PRT Homo sapiens 64 Ser Ile Pro Lys Ala Cys Cys 1 5 65 8 PRT Homo sapiens 65 Pro Phe Pro Leu Ala Asp His Phe 1 5 66 17 PRT Homo sapiens 66 Asn Ser Thr Asn His Ala Val Val Gln Thr Leu Val Asn Asn Met Asn 1 5 10 15 Pro 67 8 PRT Homo sapiens 67 Gly Lys Val Pro Lys Ala Cys Cys 1 5 68 9 PRT Homo sapiens 68 His Tyr Asn Ala His His Phe Asn Leu 1 5 69 17 PRT Homo sapiens 69 Ala Glu Thr Gly His Ser Lys Ile Met Arg Ala Ala His Lys Val Ser 1 5 10 15 Asn 70 7 PRT Homo sapiens 70 Pro Glu Ile Gly Tyr Cys Cys 1 5 71 8 PRT Homo sapiens 71 Pro Ser His Ile Ala Gly Thr Ser 1 5 72 29 PRT Homo sapiens 72 Gly Ser Ser Leu Ser Phe His Ser Thr Val Ile Asn His Tyr Arg Met 1 5 10 15 Arg Gly His Ser Pro Phe Ala Asn Leu Lys Ser Cys Cys 20 25 73 5 PRT Homo sapiens 73 Pro Tyr Ile Trp Ser 1 5 74 17 PRT Homo sapiens 74 Leu Asp Thr Gln Tyr Ser Lys Val Leu Ala Leu Tyr Asn Gln His Asn 1 5 10 15 Pro 75 8 PRT Homo sapiens 75 Gly Ala Ser Ala Ala Pro Cys Cys 1 5 76 5 PRT Homo sapiens 76 Pro Tyr Leu Trp Ser 1 5 77 17 PRT Homo sapiens 77 Ser Asp Thr Gln His Ser Arg Val Leu Ser Leu Tyr Asn Thr Ile Asn 1 5 10 15 Pro 78 8 PRT Homo sapiens 78 Glu Ala Ser Ala Ser Pro Cys Cys 1 5 79 8 PRT Homo sapiens 79 Ala Leu Pro Val Ala Leu Ser Gly 1 5 80 20 PRT Homo sapiens 80 Ser Gly Gly Pro Pro Ala Leu Asn His Ala Val Leu Arg Ala Leu Met 1 5 10 15 His Ala Ala Ala 20 81 9 PRT Homo sapiens 81 Pro Gly Ala Ala Asp Leu Pro Cys Cys 1 5 82 8 PRT Homo sapiens 82 Glu Phe Pro Leu Arg Ser His Leu 1 5 83 17 PRT Homo sapiens 83 Glu Pro Thr Asn His Ala Val Ile Gln Thr Leu Met Asn Ser Met Asp 1 5 10 15 Pro 84 8 PRT Homo sapiens 84 Glu Ser Thr Pro Pro Thr Cys Cys 1 5 85 8 PRT Homo sapiens 85 Asp Phe Pro Leu Arg Ser His Leu 1 5 86 17 PRT Homo sapiens 86 Glu Pro Thr Asn His Ala Ile Ile Gln Thr Leu Met Asn Ser Met Asp 1 5 10 15 Pro 87 8 PRT Homo sapiens 87 Gly Ser Thr Pro Pro Ser Cys Cys 1 5 88 8 PRT Homo sapiens 88 Asp Phe Pro Leu Arg Ser His Leu 1 5 89 17 PRT Homo sapiens 89 Glu Pro Thr Asn His Ala Ile Ile Gln Thr Leu Leu Asn Ser Met Ala 1 5 10 15 Pro 90 8 PRT Homo sapiens 90 Asp Ala Ala Pro Ala Ser Cys Cys 1 5 91 8 PRT Homo sapiens 91 Gln Phe Pro Met Pro Lys Ser Leu 1 5 92 17 PRT Homo sapiens 92 Lys Pro Ser Asn His Ala Thr Ile Gln Ser Ile Val Arg Ala Val Gly 1 5 10 15 Val 93 9 PRT Homo sapiens 93 Val Pro Gly Ile Pro Glu Pro Cys Cys 1 5 94 8 PRT Homo sapiens 94 Pro Asn Pro Val Gly Glu Glu Phe 1 5 95 17 PRT Homo sapiens 95 His Pro Thr Asn His Ala Tyr Ile Gln Ser Leu Leu Lys Arg Tyr Gln 1 5 10 15 Pro 96 8 PRT Homo sapiens 96 His Arg Val Pro Ser Thr Cys Cys 1 5 97 8 PRT Homo sapiens 97 Leu Arg Val Leu Arg Asp Gly Ile 1 5 98 17 PRT Homo sapiens 98 Asn Ser Phe Asn His Ala Ile Ile Gln Asn Leu Ile Asn Gln Leu Val 1 5 10 15 Asp 99 8 PRT Homo sapiens 99 Gln Ser Val Pro Arg Pro Ser Cys 1 5 100 15 PRT Homo sapiens 100 Pro Thr Gln Leu Asn Ala Ile Ser Val Leu Tyr Phe Asp Asp Ser 1 5 10 15 101 19 PRT Homo sapiens 101 Ser Asn Val Ile Leu Lys Lys Tyr Arg Asn Met Val Val Arg Ala Cys 1 5 10 15 Gly Cys His 102 15 PRT Homo sapiens 102 Pro Thr Lys Leu Asn Ala Ile Ser Val Leu Tyr Phe Asp Asp Ser 1 5 10 15 103 19 PRT Homo sapiens 103 Ser Asn Val Ile Leu Lys Lys Tyr Arg Asn Met Val Val Arg Ser Cys 1 5 10 15 Gly Cys His 104 15 PRT Homo sapiens 104 Pro Thr Lys Leu Asn Ala Ile Ser Val Leu Tyr Phe Asp Asp Asn 1 5 10 15 105 19 PRT Homo sapiens 105 Ser Asn Val Ile Leu Lys Lys Tyr Arg Asn Met Val Val Arg Ala Cys 1 5 10 15 Gly Cys His 106 15 PRT Homo sapiens 106 Pro Thr Arg Leu Gly Ala Leu Pro Val Leu Tyr His Leu Asn Asp 1 5 10 15 107 19 PRT Homo sapiens 107 Glu Asn Val Asn Leu Lys Lys Tyr Arg Asn Met Ile Val Lys Ser Cys 1 5 10 15 Gly Cys His 108 15 PRT Homo sapiens 108 Pro Thr Glu Leu Ser Ala Ile Ser Met Leu Tyr Leu Asp Glu Asn 1 5 10 15 109 19 PRT Homo sapiens 109 Glu Lys Val Val Leu Lys Asn Tyr Gln Asp Met Val Val Glu Gly Cys 1 5 10 15 Gly Cys Arg 110 15 PRT Homo sapiens 110 Pro Thr Glu Leu Ser Ala Ile Ser Met Leu Tyr Leu Asp Glu Tyr 1 5 10 15 111 19 PRT Homo sapiens 111 Asp Lys Val Val Leu Lys Asn Tyr Gln Glu Met Val Val Glu Gly Cys 1 5 10 15 Gly Cys Arg 112 15 PRT Homo sapiens 112 Pro Thr Gln Leu Asp Ser Val Ala Met Leu Tyr Leu Asn Asp Gln 1 5 10 15 113 19 PRT Homo sapiens 113 Ser Thr Val Val Leu Lys Asn Tyr Gln Glu Met Thr Val Val Gly Cys 1 5 10 15 Gly Cys Arg 114 15 PRT Homo sapiens 114 Pro Thr Glu Tyr Asp Tyr Ile Lys Leu Ile Tyr Val Asn Arg Asp 1 5 10 15 115 19 PRT Homo sapiens 115 Gly Arg Val Ser Ile Ala Asn Val Asn Gly Met Ile Ala Lys Lys Cys 1 5 10 15 Gly Cys Ser 116 15 PRT Homo sapiens 116 Pro Thr Lys Leu Arg Pro Met Ser Met Leu Tyr Tyr Asp Asp Gly 1 5 10 15 117 19 PRT Homo sapiens 117 Gln Asn Ile Ile Lys Lys Asp Ile Gln Asn Met Ile Val Glu Glu Cys 1 5 10 15 Gly Cys Ser 118 15 PRT Homo sapiens 118 Pro Gln Ala Leu Glu Pro Leu Pro Ile Val Tyr Tyr Val Gly Arg 1 5 10 15 119 18 PRT Homo sapiens 119 Lys Pro Lys Val Glu Gln Leu Ser Asn Met Ile Val Arg Ser Cys Lys 1 5 10 15 Cys Ser 120 15 PRT Homo sapiens 120 Ser Gln Asp Leu Glu Pro Leu Thr Ile Leu Tyr Tyr Ile Gly Lys 1 5 10 15 121 18 PRT Homo sapiens 121 Thr Pro Lys Ile Glu Gln Leu Ser Asn Met Ile Val Lys Ser Cys Lys 1 5 10 15 Cys Ser 122 15 PRT Homo sapiens 122 Pro Ala Arg Leu Ser Pro Ile Ser Val Leu Phe Phe Asp Asn Ser 1 5 10 15 123 19 PRT Homo sapiens 123 Asp Asn Val Val Leu Arg Gln Tyr Glu Asp Met Val Val Asp Glu Cys 1 5 10 15 Gly Cys Arg 124 15 PRT Homo sapiens 124 Pro Thr Arg Leu Ser Pro Ile Ser Ile Leu Phe Ile Asp Ser Ala 1 5 10 15 125 19 PRT Homo sapiens 125 Asn Asn Val Val Tyr Lys Gln Tyr Glu Asp Met Val Val Glu Ser Cys 1 5 10 15 Gly Cys Arg 126 15 PRT Homo sapiens 126 Pro Thr Lys Leu Thr Pro Ile Ser Ile Leu Tyr Ile Asp Ala Gly 1 5 10 15 127 19 PRT Homo sapiens 127 Asn Asn Val Val Tyr Lys Gln Tyr Glu Asp Met Val Val Glu Ser Cys 1 5 10 15 Gly Cys Arg 128 15 PRT Homo sapiens 128 Pro Ala Arg Leu Ser Pro Ile Ser Ile Leu Tyr Ile Asp Ala Ala 1 5 10 15 129 19 PRT Homo sapiens 129 Asn Asn Val Val Tyr Lys Gln Tyr Glu Asp Met Val Val Glu Ala Cys 1 5 10 15 Gly Cys Arg 130 15 PRT Homo sapiens 130 Pro Glu Lys Met Ser Ser Leu Ser Ile Leu Phe Phe Asp Glu Asn 1 5 10 15 131 19 PRT Homo sapiens 131 Lys Asn Val Val Leu Lys Val Tyr Pro Asn Met Ile Val Glu Ser Cys 1 5 10 15 Ala Cys Arg 132 13 PRT Homo sapiens 132 Pro Val Lys Thr Lys Pro Leu Ser Met Leu Tyr Val Asp 1 5 10 133 19 PRT Homo sapiens 133 Gly Arg Val Leu Leu Asp His His Lys Asp Met Ile Val Glu Glu Cys 1 5 10 15 Gly Cys Leu 134 15 PRT Homo sapiens 134 Pro Tyr Lys Tyr Val Pro Ile Ser Val Leu Met Ile Glu Ala Asn 1 5 10 15 135 19 PRT Homo sapiens 135 Gly Ser Ile Leu Tyr Lys Glu Tyr Glu Gly Met Ile Ala Glu Ser Cys 1 5 10 15 Thr Cys Arg 136 10 PRT Homo sapiens 136 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 137 10 PRT Homo sapiens 137 Lys Lys Arg Arg Lys Cys Cys Ala Ser Cys 1 5 10 138 10 PRT Homo sapiens 138 Lys Arg Arg Lys Lys Val Lys Arg Arg Arg 1 5 10 139 10 PRT Homo sapiens 139 His His His His His Arg Lys Arg Lys Tyr 1 5 10 140 10 PRT Homo sapiens 140 Glu Pro Ser Glu Glu Gln Gln Tyr Pro Pro 1 5 10 141 10 PRT Homo sapiens 141 Leu Leu Leu Leu Leu Leu Phe Leu Leu Leu 1 5 10 142 10 PRT Homo sapiens 142 Tyr Tyr Tyr Tyr Tyr Tyr Phe Lys His Thr 1 5 10 143 10 PRT Homo sapiens 143 Val Val Val Val Val Ile Val Val Val Val 1 5 10 144 10 PRT Homo sapiens 144 Ser Asp Asp Ser Ser Asp Ser Asp Asn Asp 1 5 10 145 10 PRT Homo sapiens 145 Phe Phe Phe Phe Phe Phe Phe Phe Phe Phe 1 5 10 146 10 PRT Homo sapiens 146 Arg Ser Ser Gln Arg Arg Lys Ala Lys Glu 1 5 10 147 10 PRT Homo sapiens 147 Asp Asp Asp Asp Asp Asp Asp Asp Glu Ala 1 5 10 148 10 PRT Homo sapiens 148 Leu Val Val Leu Leu Leu Ile Ile Leu Phe 1 5 10 149 10 PRT Homo sapiens 149 Gly Gly Gly Gly Gly Gly Gly Gly Gly Gly 1 5 10 150 10 PRT Homo sapiens 150 Trp Trp Trp Trp Trp Trp Trp Trp Trp Trp 1 5 10 151 10 PRT Homo sapiens 151 Gln Asn Asn Gln Gln Lys Asn Ser Asp Asp 1 5 10 152 5 PRT Homo sapiens 152 Asp Asp Asp Asp Asp 1 5 153 10 PRT Homo sapiens 153 Trp Trp Trp Trp Trp Trp Trp Trp Trp Trp 1 5 10 154 10 PRT Homo sapiens 154 Ile Ile Ile Ile Ile Ile Ile Ile Ile Ile 1 5 10 155 10 PRT Homo sapiens 155 Ile Val Val Ile Ile His Ile Ile Ile Ile 1 5 10 156 10 PRT Homo sapiens 156 Ala Ala Ala Ala Ala Glu Ala Ser Ala Ala 1 5 10 157 10 PRT Homo sapiens 157 Pro Pro Pro Pro Pro Pro Pro Pro Pro Pro 1 5 10 158 10 PRT Homo sapiens 158 Glu Pro Pro Lys Glu Lys Ser Lys Leu Lys 1 5 10 159 10 PRT Homo sapiens 159 Gly Gly Gly Gly Gly Gly Gly Ser Glu Arg 1 5 10 160 10 PRT Homo sapiens 160 Tyr Tyr Tyr Tyr Tyr Tyr Tyr Phe Tyr Tyr 1 5 10 161 10 PRT Homo sapiens 161 Ala His Gln Ala Ala His His Asp Glu Lys 1 5 10 162 10 PRT Homo sapiens 162 Ala Ala Ala Ala Ala Ala Ala Ala Ala Ala 1 5 10 163 10 PRT Homo sapiens 163 Tyr Phe Phe Asn Phe Asn Asn Tyr Tyr Asn 1 5 10 164 10 PRT Homo sapiens 164 Tyr Tyr Tyr Tyr Tyr Phe Tyr Tyr His Tyr 1 5 10 165 10 PRT Homo sapiens 165 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 166 10 PRT Homo sapiens 166 Glu His His Asp Asp Leu Glu Ser Glu Ser 1 5 10 167 10 PRT Homo sapiens 167 Gly Gly Gly Gly Gly Gly Gly Gly Gly Gly 1 5 10 168 10 PRT Homo sapiens 168 Glu Glu Asp Glu Glu Pro Glu Ala Val Glu 1 5 10 169 10 PRT Homo sapiens 169 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 170 10 PRT Homo sapiens 170 Ala Pro Pro Ser Ser Pro Pro Gln Asp Glu 1 5 10 171 10 PRT Homo sapiens 171 Phe Phe Phe Pro Phe Tyr Ser Phe Phe Phe 1 5 10 172 10 PRT Homo sapiens 172 Phe Phe Phe Phe Phe Ile His Pro Pro Val 1 5 10 173 10 PRT Homo sapiens 173 Leu Leu Leu Leu Leu Trp Ile Met Leu Phe 1 5 10 174 10 PRT Homo sapiens 174 Asn Ala Ala Asn Asn Ser Ala Pro Arg Leu 1 5 10 175 5 PRT Homo sapiens 175 Ser Asp Asp Ala Ala 1 5 176 4 PRT Homo sapiens 176 Gly Lys Ser Gln 1 177 5 PRT Homo sapiens 177 Tyr His His His His 1 5 178 4 PRT Homo sapiens 178 Thr Ser His Lys 1 179 5 PRT Homo sapiens 179 Met Leu Leu Met Met 1 5 180 10 PRT Homo sapiens 180 Asn Asn Asn Asn Asn Leu Ser Lys Glu Tyr 1 5 10 181 10 PRT Homo sapiens 181 Ala Ser Ser Ala Ala Asp Leu Pro Pro Pro 1 5 10 182 10 PRT Homo sapiens 182 Thr Thr Thr Thr Thr Thr Ser Ser Thr His 1 5 10 183 10 PRT Homo sapiens 183 Asn Asn Asn Asn Asn Gln Glu Asn Asn Thr 1 5 10 184 10 PRT Homo sapiens 184 His His His His His Tyr His His His His 1 5 10 185 10 PRT Homo sapiens 185 Ala Ala Ala Ala Ala Ser Ser Ala Ala Leu 1 5 10 186 10 PRT Homo sapiens 186 Ile Ile Ile Ile Ile Lys Thr Thr Ile Val 1 5 10 187 9 PRT Homo sapiens 187 Val Val Val Val Val Val Ile Ile His 1 5 188 10 PRT Homo sapiens 188 Gln Gln Gln Gln Gln Leu Ile Gln Gln Gln 1 5 10 189 10 PRT Homo sapiens 189 Thr Thr Thr Thr Thr Ala Asn Ser Thr Ala 1 5 10 190 10 PRT Homo sapiens 190 Leu Leu Leu Leu Leu Leu His Ile Leu Asn 1 5 10 191 10 PRT Homo sapiens 191 Val Val Val Val Val Tyr Tyr Val Met Pro 1 5 10 192 10 PRT Homo sapiens 192 His Asn Asn His His Asn Arg Arg Asn Arg 1 5 10 193 10 PRT Homo sapiens 193 Phe Ser Ser Leu Leu Gln Met Ala Ser Gly 1 5 10 194 10 PRT Homo sapiens 194 Ile Val Val Met Met His Arg Val Met Ser 1 5 10 195 10 PRT Homo sapiens 195 Asn Asn Asn Asn Phe Asn Gly Gly Asp Ala 1 5 10 196 10 PRT Homo sapiens 196 Pro Ser Ser Pro Pro Pro His Val Pro Gly 1 5 10 197 6 PRT Homo sapiens 197 Glu Asp Gly Phe Pro Gly 1 5 198 9 PRT Homo sapiens 198 Thr Lys Ser Tyr His Ala Ala Gly Ser 1 5 199 9 PRT Homo sapiens 199 Val Ile Ile Val Val Ser Asn Ile Thr 1 5 200 9 PRT Homo sapiens 200 Pro Pro Pro Pro Pro Ala Leu Pro Pro 1 5 201 9 PRT Homo sapiens 201 Lys Lys Lys Lys Lys Ala Lys Glu Pro 1 5 202 10 PRT Homo sapiens 202 Pro Ala Ala Pro Pro Pro Ser Pro Ser Pro 1 5 10 203 10 PRT Homo sapiens 203 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 204 10 PRT Homo sapiens 204 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 205 10 PRT Homo sapiens 205 Ala Val Val Ala Ala Val Val Val Val Thr 1 5 10 206 10 PRT Homo sapiens 206 Pro Pro Pro Pro Pro Pro Pro Pro Pro Pro 1 5 10 207 10 PRT Homo sapiens 207 Thr Thr Thr Thr Thr Gln Thr Glu Thr Thr 1 5 10 208 10 PRT Homo sapiens 208 Gln Glu Glu Lys Lys Ala Lys Lys Lys Lys 1 5 10 209 10 PRT Homo sapiens 209 Leu Leu Leu Leu Leu Leu Leu Met Leu Met 1 5 10 210 10 PRT Homo sapiens 210 Asn Ser Ser Asn Asn Glu Arg Ser Thr Ser 1 5 10 211 10 PRT Homo sapiens 211 Ala Ala Ala Ala Ala Pro Pro Ser Pro Pro 1 5 10 212 10 PRT Homo sapiens 212 Ile Ile Ile Ile Ile Leu Met Leu Ile Ile 1 5 10 213 10 PRT Homo sapiens 213 Ser Ser Ser Ser Ser Pro Ser Ser Ser Asn 1 5 10 214 10 PRT Homo sapiens 214 Val Met Met Val Val Ile Met Ile Ile Met 1 5 10 215 10 PRT Homo sapiens 215 Leu Leu Leu Leu Leu Val Leu Leu Leu Leu 1 5 10 216 10 PRT Homo sapiens 216 Tyr Tyr Tyr Tyr Tyr Tyr Tyr Phe Tyr Tyr 1 5 10 217 10 PRT Homo sapiens 217 Phe Leu Leu Phe Phe Tyr Tyr Phe Ile Phe 1 5 10 218 10 PRT Homo sapiens 218 Asp Asp Asp Asp Asp Val Asp Asp Asp Asn 1 5 10 219 10 PRT Homo sapiens 219 Asp Glu Glu Asp Asp Gly Asp Glu Ala Gly 1 5 10 220 5 PRT Homo sapiens 220 Ser Asn Tyr Asn Ser 1 5 221 4 PRT Homo sapiens 221 Gly Asn Gly Lys 1 222 10 PRT Homo sapiens 222 Ser Glu Asp Ser Ser Arg Gln Lys Asn Glu 1 5 10 223 10 PRT Homo sapiens 223 Asn Lys Lys Asn Asn Lys Asn Asn Asn Gln 1 5 10 224 10 PRT Homo sapiens 224 Val Val Val Val Val Pro Ile Val Val Ile 1 5 10 225 10 PRT Homo sapiens 225 Ile Val Val Ile Ile Lys Ile Val Val Ile 1 5 10 226 10 PRT Homo sapiens 226 Leu Leu Leu Leu Leu Val Lys Leu Tyr Tyr 1 5 10 227 10 PRT Homo sapiens 227 Lys Lys Lys Lys Lys Glu Lys Lys Lys Gly 1 5 10 228 10 PRT Homo sapiens 228 Lys Asn Asn Lys Lys Gln Asp Val Gln Lys 1 5 10 229 10 PRT Homo sapiens 229 Tyr Tyr Tyr Tyr Tyr Leu Ile Tyr Tyr Ile 1 5 10 230 10 PRT Homo sapiens 230 Arg Gln Gln Arg Arg Ser Gln Pro Glu Pro 1 5 10 231 10 PRT Homo sapiens 231 Asn Asp Glu Asn Asn Asn Asn Asn Asp Ala 1 5 10 232 10 PRT Homo sapiens 232 Met Met Met Met Met Met Met Met Met Met 1 5 10 233 10 PRT Homo sapiens 233 Val Val Val Val Val Ile Ile Thr Val Val 1 5 10 234 10 PRT Homo sapiens 234 Val Val Val Val Val Val Val Val Val Val 1 5 10 235 10 PRT Homo sapiens 235 Arg Glu Glu Arg Arg Arg Glu Glu Glu Asp 1 5 10 236 10 PRT Homo sapiens 236 Ala Gly Gly Ala Ser Ser Glu Ser Ser Arg 1 5 10 237 10 PRT Homo sapiens 237 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 238 10 PRT Homo sapiens 238 Gly Gly Gly Gly Gly Lys Gly Ala Gly Gly 1 5 10 239 10 PRT Homo sapiens 239 Cys Cys Cys Cys Cys Cys Cys Cys Cys Cys 1 5 10 240 10 PRT Homo sapiens 240 His Arg Arg His His Ser Ser Arg Arg Ser 1 5 10 

What is claimed is:
 1. A method for classifying object sequences, comprising the computer implemented steps of: obtaining a set of known aligned sequences, some of which form a first class exclusive of other sequences in the set, each known sequence in the set having a respective set of n_(i) elements, different elements possessing different physical properties from a respective set of q_(i) physical properties of interest, where i is sequence alignment position; for each known sequence, forming a respective vector of q_(i) bits, a bit being set to 1 to indicate that a physical property is found in an element of the sequence and a bit being set to 0 to indicate that a physical property is absent from an element of the sequence; for each bit, defining a profile as a function of the probability of the bit being set to 1; given a test sequence to classify, forming a respective representative vector of q bits for the test sequence; assigning a score for the test sequence as a function of the defined profiles per bit and the bit values in the representative vector of the test sequence; and calculating probability of the test sequence being of the first class as a function of the assigned score.
 2. A method as claimed in claim 1 wherein the set of physical properties of interest include hydrophobicity, helix propensity, sheet propensity, hydrogen donor propensity, hydrogen acceptor propensity, the state of being charged, aromaticity, sidechain linearity unbranched, sidechain volume, Phi-Psi flexibility and crosslinkability.
 3. A method as claimed in claim 1 wherein the step of defining a profile includes defining probability of two terms LO(1) and LO(0) for each bit, where LO(1) is the log odds ratio of the probability of the bit being set to 1 given a sequence of the first class and the probability of the bit being set to 1 given a sequence not of the first class, and LO(0) is the log odds ratio of the probability of the bit being set to 0 given a sequence of the first class and the probability of the bit being set to 0 given a sequence not of the first class.
 4. A method as claimed in claim 3 wherein the step of assigning a score includes: for each bit in the representative vector of the test sequence, computing a bitwise score equal to (the value of the bit multiplied by the product of the probability of the bit equaling 1 in the first class and LO(1) of the corresponding bit in the representative vector of a known sequence) plus the product of (1-value of the bit) and the product of the probability of the bit equaling 0 in the first class and LO(0) of the corresponding bit in the representative vector of the known sequence.
 5. A method as claimed in claim 1 further comprising normalizing the assigned score; and the step of calculating probability includes calculating Eq
 22. 6. A method as claimed in claim 5 wherein the step of calculating probability further includes calculating probability that distribution of the normalized score of the test sequence is equal to distribution of normalized scores for the known sequences of the first class. 